NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

One-Size-Fits-None: Understanding and Enhancing Slow-Fault Tolerance in Modern Distributed Systems

Lu, Ruiming; Lu, Yunchi; Jiang, Yuxuan; Xue, Guangtao; Huang, Peng (April 2025, 22nd USENIX Symposium on Networked Systems Design and Implementation)

Recent studies have shown that various hardware components exhibit fail-slow behavior at scale. However, the characteristics of distributed software's tolerance of such slow faults remain ill-understood. This paper presents a comprehensive study that investigates the characteristics and current practices of slow-fault tolerance in modern distributed software. We focus on the fundamentally nuanced nature of slow faults. We develop a testing pipeline to systematically introduce diverse slow faults, measure their impact under different workloads, and identify the patterns. Our study shows that even small changes can lead to dramatically different reactions. While some systems have added slow-fault handling mechanisms, they are mostly controlled by static thresholds, which can hardly accommodate the highly sensitive and dynamic characteristics. To address this gap, we design ADR, a lightweight library to use within system code and make fail-slow handling adaptive. Evaluation shows ADR significantly reduces the impact of slow faults.
more » « less
Free, publicly-accessible full text available April 28, 2026
One-Size-Fits-None: Understanding and Enhancing Slow-Fault Tolerance in Modern Distributed Systems

Lu, Ruiming; Lu, Yunchi; Jiang, Yuxuan; Xue, Guangtao; Huang, Peng (April 2025, 22nd USENIX Symposium on Networked Systems Design and Implementation)

Recent studies have shown that various hardware components exhibit fail-slow behavior at scale. However, the characteristics of distributed software's tolerance of such slow faults remain ill-understood. This paper presents a comprehensive study that investigates the characteristics and current practices of slow-fault tolerance in modern distributed software. We focus on the fundamentally nuanced nature of slow faults. We develop a testing pipeline to systematically introduce diverse slow faults, measure their impact under different workloads, and identify the patterns. Our study shows that even small changes can lead to dramatically different reactions. While some systems have added slow-fault handling mechanisms, they are mostly controlled by static thresholds, which can hardly accommodate the highly sensitive and dynamic characteristics. To address this gap, we design ADR, a lightweight library to use within system code and make fail-slow handling adaptive. Evaluation shows ADR significantly reduces the impact of slow faults.
more » « less
Free, publicly-accessible full text available April 28, 2026
Deriving Semantic Checkers from Tests to Detect Silent Failures in Production Distributed Systems

Lou, Chang; Parikesit, Dimas Shidqi; Huang, Yujin; Yang, Zhewen; Diwangkara, Senapati; Jing, Yuzhuo; Kistijantoro, Achmad Imam; Yuan, Ding; Nath, Suman; Huang, Peng (July 2025, 19th USENIX Symposium on Operating Systems Design and Implementation)

Production distributed systems provide rich features, but various defects can cause a system to silently violate its semantics without explicit errors. Such failures cause serious consequences. Yet, they are extremely challenging to detect, as it requires deep domain knowledge and substantial manual efforts to write good checkers. In this paper, we explore a novel approach that directly derives semantic checkers from system test code. We first present a large-scale study on existing system test cases. Guided by the study findings, we develop T2C, a framework that uses static and dynamic analysis to transform and generalize a test into a runtime checker. We apply T2C on four large, popular distributed systems and successfully derive tens to hundreds of checkers. These checkers detect 15 out of 20 real-world silent failures we reproduce and incur small runtime overhead.
more » « less
Free, publicly-accessible full text available July 7, 2026
Deriving semantic checkers from tests to detect silent failures in production distributed systems

Lou, Chang; Parikesit, Dimas Shidqi; Huang, Yujin; Yang, Zhewen; Diwangkara, Senapati; Jing, Yuzhuo; Kistijantoro, Achmad Imam; Yuan, Ding; Nath, Suman; Huang, Peng (July 2025, 19th USENIX Symposium on Operating Systems Design and Implementation)

Production distributed systems provide rich features, but various defects can cause a system to silently violate its semantics without explicit errors. Such failures cause serious consequences. Yet, they are extremely challenging to detect, as it requires deep domain knowledge and substantial manual efforts to write good checkers. In this paper, we explore a novel approach that directly derives semantic checkers from system test code. We first present a large-scale study on existing system test cases. Guided by the study findings, we develop T2C, a framework that uses static and dynamic analysis to transform and generalize a test into a runtime checker. We apply T2C on four large, popular distributed systems and successfully derive tens to hundreds of checkers. These checkers detect 15 out of 20 real-world silent failures we reproduce and incur small runtime overhead.
more » « less
Free, publicly-accessible full text available July 7, 2026
Efficient Reproduction of Fault-Induced Failures in Distributed Systems with Feedback-Driven Fault Injection

https://doi.org/10.1145/3694715.3695979

Pan, Jia; Wu, Haoze; Leesatapornwongsa, Tanakorn; Nath, Suman; Huang, Peng (November 2024, ACM)

Debugging a failure usually requires reproducing it first. This can be hard for failures in production distributed systems, where bugs are exposed only by some unusual faulty events. While fault injection testing becomes popular, existing solutions are designed for bug finding. They are ineffective and inefficient to reproduce a specific failure during debugging. We explore a new type of fault injection technique for quickly reproducing a given fault-induced production failure in distributed systems. We present a tool, Anduril, that uses static causal analysis and a novel feedback-driven algorithm to quickly search the enormous fault space for the root-cause fault and timing. We evaluate Anduril on 22 real-world complex fault-induced failures from five large-scale distributed systems. Anduril reproduced all failures by identifying and injecting the root-cause faults at the right time, in a median of 8 minutes.
more » « less
Full Text Available
Efficient Exposure of Partial Failure Bugs in Distributed Systems with Inferred Abstract States

Wu, Haoze; Pan, Jia; Huang, Peng (April 2024, 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24))

Many distributed system failures, especially the notorious partial service failures, are caused by bugs that are only triggered by subtle faults at rare timing. Existing testing is inefficient in exposing such bugs. This paper presents Legolas, a fault injection testing framework designed to address this gap. To precisely simulate subtle faults, Legolas statically analyzes the system code and instruments hooks within a system. To efficiently explore numerous faults, Legolas introduces a novel notion of abstract states and automatically infers abstract states from code. During testing, Legolas designs an algorithm that leverages the inferred abstract states to make careful fault injection decisions. We applied Legolas on the latest releases of six popular, extensively tested distributed systems. Legolas found 20 new bugs that result in partial service failures.
more » « less
Full Text Available
Text-CRS: A Generalized Certified Robustness Framework against Textual Adversarial Attacks

https://doi.org/10.1109/SP54263.2024.00053

Zhang, Xinyu; Hong, Hanbin; Hong, Yuan; Huang, Peng; Wang, Binghui; Ba, Zhongjie; Ren, Kui (May 2024, IEEE)

Full Text Available
Pushing Performance Isolation Boundaries into Application with pBox

https://doi.org/10.1145/3600006.3613159

Hu, Yigong; Huang, Gongqi; Huang, Peng (October 2023, ACM)
Learning to Drive Anywhere

Zhu, Ruizhao; Huang, Peng; Ohn-Bar, Eshed; Saligrama, Venkatesh (November 2023, Conference on Robot Learning)

Human drivers can seamlessly adapt their driving decisions across geographical locations with diverse conditions and rules of the road, e.g., left vs. right-hand traffic. In contrast, existing models for autonomous driving have been thus far only deployed within restricted operational domains, i.e., without accounting for varying driving behaviors across locations or model scalability. In this work, we propose AnyD, a single geographically-aware conditional imitation learning (CIL) model that can efficiently learn from heterogeneous and globally distributed data with dynamic environmental, traffic, and social characteristics. Our key insight is to introduce a high-capacity geo-location-based channel attention mechanism that effectively adapts to local nuances while also flexibly modeling similarities among regions in a data-driven manner. By optimizing a contrastive imitation objective, our proposed approach can efficiently scale across the inherently imbalanced data distributions and location-dependent events. We demonstrate the benefits of our AnyD agent across multiple datasets, cities, and scalable deployment paradigms, i.e., centralized, semi-supervised, and distributed agent training. Specifically, AnyD outperforms CIL baselines by over 14% in open-loop evaluation and 30% in closed-loop testing on CARLA.
more » « less
Full Text Available
Simplifying Cloud Management with Cloudless Computing

Qiu, Yiming; Kon, Patrick; Xing, Jiarong; Huang, Yibo; Liu, Hongyi; Wang, Xinyu; Huang, Peng; Chowdhury, Mosharaf; Chen, Ang (November 2023, ACM HotNets)

Full Text Available

« Prev Next »

Search for: All records